Supervised Machine Learning

Among the two pre-eminent supervised and unsupervised machine learning techniques, classification is a popular method of the supervised algorithms -- where labeled examples of prior instances by humans can guide the training of a machine. Below, we introduce classfication with a few hands-on examples.

Agenda

Public Datasets

There are numerous ML datasets for explorations in the public domain: contributed by many commercial and academic organizations. A few examples below.

Models & Notebooks

There are even more contributions of prebuilt/open-source models (some represented as notebooks) in the open domain. Here, a few examples --

Pandas, Scikit, BQML, AutoML

Previously, in BQML model, we developed in-database classification model directly in the big query data warehouse, so continuous training, continuous scoring methods are totally opaque, managed, and seamless to consumers.

-- Jump to https://console.cloud.google.com/bigquery?project=project-dynamic-modeling&p=project-dynamic-modeling 
-- and key the model as following
CREATE OR REPLACE MODEL
 `bqml_tutorial.cardio logistic_model` OPTIONS
   (model type='LOGISTIC REG',
    auto_class_weights=TRUE,
    input_label_cols=['cardio']) AS
  SELECT age, gender, height, weight, ap_hi,
    ap_lo, cholesterol, gluc, smoke,
    alco, active, cardio
  FROM `project-dynamic-modeling.cardio_disease.cardio_disease`

There is also a managed service in Google Cloud Platform (GCP) -- called AutoML tables -- which provides a total seamless experience for citizen data science.

automltables.png

Today, we focus on the middle: building the classification model from the start. Specifically, we will be using the Google Colab (a freemium JuPyteR notebook service for Julia, Python and R) to ingest, shape, explore, visualize, and model data.

There is also a managed JupyterHub environment offered by Google (called AI Notebooks) that we will utilize later.

Garfield TVitcharoo Trivia

Garfield lays down in the evening to watch TV. I think his biometrics and activity choices during the day are indicative of his TV propensity at night. Do you see a pattern in this data?

Pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It was developed by Wes McKinney in 2008.

Pandas can be used to slice, dice, and describe data of course. And traditional sorting, filtering, grouping, transforms work too.

Descriptive Statistics

Let us describe (aka skew, mean, mode, min, max) distributions of the data

Visualizing the data distributions

Training, Testing, Scoring Datasets

UnknownUnknownDogUnknownUnknownCat

Labeled Data

Only consider the "labeled" data: data that has been "supervised" by human intelligence. Notice that our toy data has missing labels in the last two rows. We will use this data as "scoring" data.

Factor Plots

Quickly see correlations between numerical and output attributes

Data Types

Attributes -- independent variables (that presumably determine prediction)

  1. Numerical Attributes -- Independent variables in the study usually represented as a real number.
  2. Temporal Attributes -- Time variable: for example, date fields. Span/aging factors can be derived.
  3. Spatial Attributes -- Location variable: for example, latitude and longitude. Distance factors can be derived.
  4. Ordinal Attributes -- Numerical or Text variables: implies ordering. For example, low, medium, high can be encoded as 1, 2, 3 respectively
  5. Categorical Attributes -- String variables: usually do not imply any ordinality (ordering) but have small cardinality. For example, Male-Female, Winter-Spring-Summer-Fall
  6. Text Attributes -- String variables that usually have very hgh cardinality. For example, user reviews with commentary
  7. ID Attributes -- Identity attributes (usually string/long numbers) that have no significance in predicting outcome. For example, social security number, warehouse id. It is best to avoid these ID attributes in the modeling exercise.
  8. Leakage attributes -- redundant attributes that usually are deterministically correlated with the outcome label attribute. For example, say we have two temperature attributes -- one in Fahrenheit and another Celsius -- where Fahrenheit temperature is the predicted attribute, having the Celsius accidentally in the modeling will lead to absolute predictions that fail to capture true stochasticity of the model.

Labels

  1. Categorical Labels -- Usually a string or ordinal variable with small cardinality. For example, asymptomatic recovery, symptomatic recovery, intensive care recovery, fatal. This usually indicates a classification problem.
  2. Numerical Labels -- Usually a numerical output variable. For example, business travel volume. This usually indicates a regression problem.
  3. When labels do not exist in the dataset, it usually indicates a unsupervised learning problem.

Data Imputations

Impute missing values with mean, interpolation, forward-fill, backward-fill, drop altogether.

Data Shaping

Pivot, transpose, or interpolate data

Converting to Numerical Matrix

Exclude ID attributes, leakage attributes and include only numerical, temporal, spatial, ordinal, and categorical attributes. Also encode labels accordingly.

Leading Indicators

Compute quickly correlation coefficients to determine if moving one has any bearing on the output.

correlated_coefficients.png

uncorrelated.png

Simple Pearson Correlation

Compute correlation vector between label and input attributes (direction & magnitude of change)

Heart Disease Data

Remember we picked the heart disease data for MVM.

What do we know?

What leading indicators can be gleaned to predict the cardio disease?

COVID-19

John Hopkins makes COVID data available daily. We visualize cluster heatmaps of COVID in the US over last six months.

Use inline bash magic to download daily CSV data.

Collate

Collate data together temporally and compute active, tested, confirmed, recovered cases in the US.

Ensure we rollup the weekly data and align to Monday

Plot COVID Geomap

Animate Weekly Progression

First Garfield Model

Model Garfield TVitcharoo Bot using simple logistic regression. Logistic regression is similar to linear regression, but instead of predicting a continuous output, it classifies training examples by a set of categories or labels. For example, linear regression on a set of electoral surveys may be used to predict candidate's electoral votes, but logistic regression could be used to predict presidential elect. Logistic regression predicts classes, not numeric magnitude. It can easily be used to predict multiclass problems where there are more than two label categories.

Use Model to Predict

Explaining the Model

Can we explain the model? Logistic Regression is just a linear regression: where a continuous variable is classed into a category based on a logistic curve.

Logistic Function{width=50%}

Naive Bayes Model

Naive Bayes classifiers are built on Bayesian classification methods. These rely on Bayes's theorem, which is an equation describing the relationship of conditional probabilities of statistical quantities. In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can write as $P(L~|~{\rm features})$. Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:

$$ P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})} $$

If we are trying to decide between two labels—let's call them $L_1$ and $L_2$—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:

$$ \frac{P(L_1~|~{\rm features})}{P(L_2~|~{\rm features})} = \frac{P({\rm features}~|~L_1)}{P({\rm features}~|~L_2)}\frac{P(L_1)}{P(L_2)} $$

All we need now is some model by which we can compute $P({\rm features}~|~L_i)$ for each label. Such a model is called a generative model because it specifies the hypothetical random process that generates the data. Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier. The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

This is where the "naive" in "naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification. Different types of naive Bayes classifiers rest on different naive assumptions about the data, and we will examine a few of these in the following sections.

Standard Scaling

What went wrong? Remember Bayesian models make assumptions about prior probabilities. In our case, we assumed our data followed "Gaussian" distribution, but remember the one hot encoding uses 0 and 1 (bimodal encoding) of features. NB classifier is a parametric model.

K Nearest Neighbor

The principle behind nearest neighbor methods is to find a predefined number of training samples (K) closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning).

Decision Trees, Random Forest Methods

Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification: very similar to 20 questions game where the response can only be a yes/no. Random forests are an example of an ensemble learner built on decision trees. Ensemble methods rely on aggregating the results of an ensemble of simpler estimators. The somewhat surprising result with such ensemble methods is that the sum can be greater than the parts: that is, a majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting

Visualizing the Decision Tree

We have randomized the data by fitting each estimator with a random subset of 70% of the training points. In practice, decision trees are more effectively randomized by injecting some stochasticity in how the splits are chosen: this way all the data contributes to the fit each time, but the results of the fit still have the desired randomness. In Scikit-Learn, an optimized ensemble of randomized decision trees is implemented in the RandomForestClassifier estimator, which takes care of all the randomization automatically.

Gradient Boosted Methods

Bagging -- bootstrap aggregation: where all points are randomly selected -- aka without replacement. When we select a few observations more than others due to their difficulty in separating classes (and reward those trees that handle better), we are applying boosting methods. Bossting works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns. This means that samples that are difficult to classify receive increasing larger weights until the algorithm identifies a model that correctly classifies these samples. Replacement is allowed in boosting methods.

XGBoost

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. XGBoost stands for eXtreme Gradient Boosting.

Support Vector Machines

Consider the simple case of a classification task, in which the two classes of points are well separated. While presumably any line that separates the points is decent enough, the dividing line that maximizes the margin between the two sets of points closest to the confusion edge is perhaps the best. Notice that a few of the training points just touch the margin: they are indicated by the black circles in this figure. These points are the pivotal elements of this fit, and are known as the support vectors, and give the algorithm its name. Support Vector Machines (SVMs) are parametric classification methods.

Model Selection

So which of the models are to be selected? How do we know which are better if they all yield different results? Three points --

  1. Non parametric methods do not assume underlying distributions, so they work better for categorical, ordinal variables.
  2. Scikit Learn -- even for decision trees although (theoretically) supports caegorical variables -- requires that we one-hot-encode input features and output labels. Since the effort is intrinsic, it is best to let accuracy dictate model choice.
  3. Intuition and experience -- let the palatability (passes your sniff test) and explainability (can logically articulate to others) -- guide the choice.

Accuracy

Precision

The ratio of correct positive predictions to the total predicted positives. Precision = TP/TP+FP

Recall

The ratio of correct positive predictions to the total positives examples. Recall = TP/TP+FN

Accuracy

Accuracy is defined as the ratio of correctly predicted examples by the total examples. Accuracy = TP+TN/TP+FP+FN+TN

F1-Score

F1 Score is the weighted average of Precision and Recall. Therefore this score takes both false positives and false negatives into account. F1 Score = 2x(Recall x Precision) / (Recall + Precision)

ROC Curve

A ROC curve (receiver operating characteristic curve) graph shows the performance of a classification model at all classification thresholds. Under normal circumstances, say binary classification we chose 0.5 as the binary separator surface, how many flip their direction when it is altered.

Cardio Model

Let us use cardio model to verify accuracy using the testing dataset. See defintion of this dataset on Kaggle

Withhold Test Set for Accuracy

Build the model

Report Accuracy

Understanding ROC Curve

EMNIST

The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19 and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset. Let us use this dataset to train and detect character scribes on a piece of white paper.

Train the Keras Model

Test on Handwriting

We achieved great accuracy on existing MNIST sample. But if we applied to a new handwriting -- aka Seshu's -- does it perform in real world? See my scribble